Hyperparameter

Chúng ta đã hoàn thành xong việc triển khai kiến trúc transformer của LLAMA2, tuy nhiên cho đến nay chúng ta chỉ chọn các giá trị cho các tham số trong Model Args tương đối nhỏ. Bây giờ chúng ta sẽ cùng nhau tìm hiểu và chọn các giá trị cụ thể mà LLAMA2 đã sử dụng trong quá trình huấn luyện.

pip -q install transformers datasets einops pytorch_lightning wandb
import torch
import torch.nn as nn
from transformers import AutoTokenizer
from datasets import load_dataset
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
from einops import rearrange # einstein operation
from huggingface_hub import notebook_login
notebook_login()

Bây giờ chúng ta đã thực sự vào train model rồi nên cũng đã đến lúc sử dụng tokenizer của LLAMA2. Để sử dụng tokenizer này, chúng ta cần đăng nhập vào tài khoản Hugging Face và đồng ý với các điều khoản của LLAMA2.

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
print("Vocab Size When Use Tokenizer of LLAMA2:", tokenizer.vocab_size)
Vocab Size When Use Tokenizer of LLAMA2: 32000

Sử dụng tokenizer của LLAMA2 thì vocab size của chúng ta sẽ là 32000.

class ModelArgs:
    def __init__(self):

        self.norm_eps = 1e-5

        self.initializer_range = 0.02
        self.learning_rate = None # experiment below
        self.n_epochs = 100

        self.rotary_dim = 128 #(self.n_embd // self.n_head) / 2

        # divided by 8
        self.n_head = 4 #32
        self.n_embd = 512 #4096
        self.max_sequence_len = 256 #2048
        self.multiple_of = 32 #256
        
        self.n_layer = 4 #32

        self.batch_size = 32 #32
        self.vocab_size = 32000

args = ModelArgs()

Sau khi thực hiện thử nghiệm và gặp liên tục vấn đề “out of memory”, tôi đã quyết định giảm xuống 1/8 so với các giá trị gốc được sử dụng trong LLAMA2 đối với các biến ảnh hưởng đến tổng số biến. Do đó, chúng ta chỉ cần tập trung vào việc thử nghiệm với learning rate để tìm ra giá trị phù hợp nhất.

Đối với biến rotary dim, tôi đã lấy từ Microsoft phiên bản 1.5. Tôi nhận thấy rằng họ chỉ xoay một nửa so với head size, tức là bằng (n_embd // n_head) // 2.

sample_train   = 960
sample_val = int(.1 * sample_train)

dataset = load_dataset("roneneldan/TinyStories")
tokenizer.pad_token = tokenizer.eos_token

subset_trainset = dataset['train'][:sample_train]['text']
subset_valset = dataset['validation'][:sample_val]['text']

tokenized_trainset = tokenizer(
    subset_trainset,
    return_tensors='pt',
    padding='max_length',  # Pad sequences to the max_seq_length
    truncation=True,  # Truncate sequences if they exceed max_seq_length
    max_length=args.max_sequence_len  # Set the maximum sequence length
)

tokenized_valset = tokenizer(
    subset_valset,
    return_tensors='pt',
    padding='max_length',  # Pad sequences to the max_seq_length
    truncation=True,  # Truncate sequences if they exceed max_seq_length
    max_length=args.max_sequence_len  # Set the maximum sequence length
)

train_data = tokenized_trainset['input_ids']
val_data = tokenized_valset['input_ids']
train_data.shape, val_data.shape
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
(torch.Size([960, 256]), torch.Size([96, 256]))

Đã đến giai đoạn train model nên ta sẽ sử dụng dữ liệu train set và valid set.

class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

import multiprocessing

cpu_count = multiprocessing.cpu_count()
print(f"Number of CPU cores: {cpu_count}")

custom_trainset = CustomDataset(train_data)
train_loader = DataLoader(custom_trainset, batch_size=args.batch_size, shuffle=True, num_workers=cpu_count)

custom_valset = CustomDataset(val_data)
val_loader = DataLoader(custom_valset, batch_size=args.batch_size, num_workers=cpu_count)

# input_ids = next(iter(train_loader))
# input_ids.shape
Number of CPU cores: 2

Data Loader hỗ trợ việc sử dụng nhiều core CPU có sẵn trên máy tính, nên tôi đã kiểm tra và sử dụng số lượng core CPU hiện có trên máy tính để trong quá trình xử lý dữ liệu nó sẽ giúp tối ưu hóa hiệu suất của Data Loader.

class Embedding(nn.Module):
    def __init__(self, args:ModelArgs):
        super().__init__()

        self.wte = nn.Embedding(args.vocab_size, args.n_embd)

    def forward(self, input_ids):
        input_shape = input_ids.shape[-1]
        input_ids = input_ids.view(-1, input_shape)

        input_ids_embd = self.wte(input_ids)

        return input_ids_embd

# embd = Embedding(args)
# input_ids_embd = embd(input_ids)
# input_ids_embd.shape

Để giảm thiểu việc sử dụng bộ nhớ, tôi đã quyết định không hiển thị chi tiết output của từng class trong quá trình thử nghiệm. Thay vào đó, tôi đã chuyển các dòng code thể hiện output của từng class thành dạng comment để thể hiện rằng không có bất kỳ thay đổi nào so với chương trước.

class RotaryEmbedding(nn.Module):
    def __init__(self, args:ModelArgs, base = 10000):
        super().__init__()
        self.rotary_dim = args.rotary_dim

        inv_freq = 1.0 / (base ** (torch.arange(0, self.rotary_dim, 2) / self.rotary_dim ))
        self.register_buffer("inv_freq", inv_freq)

        self.cos_cache = None
        self.sin_cache = None

    def forward(self, qkv):
        seqlen = qkv.shape[1]

        # Update cos sin cache
        t = torch.arange(seqlen, device = qkv.device)
        freqs = torch.outer(t, self.inv_freq)

        self.cos_cache = torch.cos(freqs)
        self.sin_cache = torch.sin(freqs)

        # Apply rotary qkv
        rotary_dim = self.cos_cache.shape[1]
        rotary_dim *= 2

        q_rot = qkv[:, :, 0, :, :rotary_dim]
        q_pass = qkv[:, :, 0, :, rotary_dim:]

        k_rot = qkv[:, :, 1, :, :rotary_dim]
        k_pass = qkv[:, :, 1, :, rotary_dim:]

        # Splits the queries and keys in half
        q1, q2 = q_rot.chunk(2, dim=-1)
        k1, k2 = k_rot.chunk(2, dim=-1)
        c, s = rearrange(self.cos_cache, "t d -> t 1 d"), rearrange(self.sin_cache, "t d -> t 1 d")

        # Computes the new keys and queries
        q_rot = torch.cat([q1 * c - q2 * s, q1 * s - q2 * c], dim=-1)
        k_rot = torch.cat([k1 * c - k2 * s, k1 * s - k2 * c], dim = -1)

        return torch.cat(
            [
                torch.cat([q_rot, q_pass], dim=-1).unsqueeze(2),
                torch.cat([k_rot, k_pass], dim=-1).unsqueeze(2),
                qkv[:, :, 2:3, :, :]
            ],
            dim=2
        )
class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

# attn_norm = RMSNorm(args.n_embd)
# input_ids_embd_norm = attn_norm(input_ids_embd)
# input_ids_embd_norm.shape
class Attention(nn.Module):
    def __init__(self, args:ModelArgs):
        super().__init__()

        self.rotary_emb = RotaryEmbedding(args)

        self.head_dim = args.n_embd // args.n_head
        opt_size = args.n_head * self.head_dim
        hidden_size = args.n_embd

        self.Wqkv = nn.Linear(hidden_size, 3 * opt_size)
        self.out_proj = nn.Linear(opt_size, hidden_size)

    def forward(self, input_ids_embd_norm):
        qkv = self.Wqkv(input_ids_embd_norm)
        qkv = rearrange(qkv, 'b t (three h d) -> b t three h d', three=3, d=self.head_dim)

        # Rotary Query & Key
        qkv = self.rotary_emb(qkv)
        q, k, v = qkv.unbind(2)

        output = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        output = rearrange(output, "... h d -> ... (h d)")

        attn_out = self.out_proj(output)

        return attn_out

# # Normalize
# attn_norm = RMSNorm(args.n_embd)
# x_embd_norm = attn_norm(input_ids_embd)

# attn = Attention(args)
# attn_out = attn(input_ids_embd_norm)
# # add residual
# attn_out += input_ids_embd
# attn_out.shape
class FeedForward(nn.Module):
    def __init__(self, args:ModelArgs):
        super().__init__()
        hidden_dim = 4 * args.n_embd
        hidden_dim = int(2 * hidden_dim / 3)
        hidden_dim = args.multiple_of * ((hidden_dim + args.multiple_of - 1) // args.multiple_of)

        self.w1 = nn.Linear(args.n_embd, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, args.n_embd, bias=False)
        self.w3 = nn.Linear(args.n_embd, hidden_dim, bias=False)

        self.act = nn.SiLU()

    def forward(self, attn_out_norm):
        hidden_states = self.w1(attn_out_norm) * self.w3(attn_out_norm)
        hidden_states = self.act(hidden_states)
        ffwd_out = self.w2(hidden_states)

        return ffwd_out

# # Normalize
# ffwd_norm = RMSNorm(args.n_embd)
# attn_out_norm = ffwd_norm(attn_out)

# ffwd = FeedForward(args)
# ffwd_out = ffwd(attn_out_norm)
# # add residual
# ffwd_out += attn_out
# ffwd_out.shape
class TransfomerBlock(nn.Module):
    def __init__(self, args:ModelArgs):
        super().__init__()

        self.attention_norm = RMSNorm(args.n_embd, args.norm_eps)
        self.ffwd_norm = RMSNorm(args.n_embd, args.norm_eps)

        self.attn = Attention(args)
        self.ffwd = FeedForward(args)

    def forward(self, input_ids_embd):

        attn_out = input_ids_embd + self.attn(self.attention_norm(input_ids_embd))

        ffwd_out = attn_out + self.ffwd(self.ffwd_norm(attn_out))

        return ffwd_out

# t_block = TransfomerBlock(args)
# ffwd_out = t_block(input_ids_embd)
# ffwd_out.shape
class TransformerHead(nn.Module):
    def __init__(self, args:ModelArgs):
        super().__init__()

        self.norm = RMSNorm(args.n_embd, args.norm_eps)
        self.linear = nn.Linear(args.n_embd, args.vocab_size)

    def forward(self, ffwd_out):
        h = self.norm(ffwd_out)
        logits = self.linear(h)

        return logits

# t_head = TransformerHead(args)
# logits = t_head(ffwd_out)
# logits.shape
class TransformerSequential(nn.Module):
    def __init__(self, args):
        super().__init__()
        self.initializer_range = args.initializer_range

        modules = [Embedding(args)]
        modules += [TransfomerBlock(args) for _ in range(args.n_layer)]
        modules.append(TransformerHead(args))

        self.layers = nn.Sequential(*modules)
        self.apply(self._init_weights)

    def forward(self, input_ids):
        return self.layers(input_ids)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.initializer_range)

# model = TransformerSequential(args)
# logits = model(input_ids)
# logits.shape
class TransformerLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.loss_fct = nn.CrossEntropyLoss()

    def forward(self, logits, labels, shift_labels = True):
        if shift_labels:
            logits = logits[..., :-1, :].contiguous()
            labels = labels[..., 1:].contiguous()

        logits = logits.view(-1, logits.shape[-1])
        labels = labels.view(-1)

        loss = self.loss_fct(logits, labels)

        return loss

# t_loss = TransformerLoss()
# loss = t_loss(logits, input_ids)
# loss
import pytorch_lightning as pl
import wandb

class ModelForVisualization(pl.LightningModule):
    def __init__(self, args, lr):
        super().__init__()
        self.learning_rate = lr

        self.model = TransformerSequential(args)
        self.t_loss = TransformerLoss()

    def forward(self, input_ids):
        return self.model(input_ids)

    def training_step(self, batch):
        input_ids = batch
        logits = self(input_ids)
        loss = self.t_loss(logits, input_ids)

        wandb.log({"train loss": loss})

        return loss

    def validation_step(self, batch):
        input_ids = batch
        logits = self(input_ids)
        loss = self.t_loss(logits, input_ids)

        wandb.log({"valid loss": loss})

        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(
            self.model.parameters(),
            lr=self.learning_rate,
        )

        return optimizer

Tôi đã sử dụng thư viện pytorch_lightning vì nó giúp tạo thanh tiến trình (progress bar) để hiển thị trạng thái của mô hình đang huấn luyện. Ngoài ra, tôi cũng sử dụng weight and bias (wanb) để theo dõi biểu đồ loss trực tuyến, giúp quá trình huấn luyện trở nên thuận tiện hơn.

Dưới đây là đoạn code mà tôi sử dụng để thử nghiệm tốc độ học (learning rate). Vì quá trình này mất rất nhiều thời gian, tôi đã chạy nó trong hơn 8 giờ nên tôi để nó dưới dạng comment và chỉ hiển thị code của learning rate tốt nhất mà tôi đã chọn. Đây là cái nhìn tổng quan về quá trình này.

Train Loss

Và tới đây là kết thúc chương này. Trong chương tiếp theo, tôi sẽ tiến hành huấn luyện mô hình với một lượng lớn dữ liệu và clean code một chút.

wandb.login()

args = ModelArgs()

# learning_rates_to_try = [0.1, 0.01, 0.001, 0.0005]
# learning_rates_to_try = np.linspace(0.001, 0.0001, 10)

# for lr in learning_rates_to_try:
#     name = f"run_lr_{lr}"
#     wandb.init(project="llama2", config=args, name=name)
#     model = ModelForVisualization(args, lr)

#     trainer = pl.Trainer(max_epochs=args.n_epochs)
#     trainer.fit(model, train_loader, val_loader)

best_learning_rate = 0.0003
model = ModelForVisualization(args, best_learning_rate)
trainer = pl.Trainer(max_epochs=6)
trainer.fit(model, train_loader, val_loader)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
wandb: Currently logged in as: nvtai0452. Use `wandb login --relogin` to force relogin
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
WARNING:pytorch_lightning.loggers.tensorboard:Missing logger folder: /content/lightning_logs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py:293: The number of training batches (30) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=100` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type                  | Params
-------------------------------------------------
0 | model  | TransformerSequential | 45.5 M
1 | t_loss | TransformerLoss       | 0     
-------------------------------------------------
45.5 M    Trainable params
0         Non-trainable params
45.5 M    Total params
181.845   Total estimated model params size (MB)
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_121416-zt6vny14
Finishing last run (ID:zt6vny14) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▇▆▆▆▆▅▄▄▄▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▂▁▁▂▂▂▂▁▁▁▂▂▂▂▁
valid loss ▃▁▁▁▁▂▂▂▃▄▄▄▅▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇█████

Run summary:


train loss 1.36344
valid loss 3.93152

View run run_lr_0.001 at: https://wandb.ai/nvtai0452/llama2/runs/zt6vny14
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_121416-zt6vny14/logs
Successfully finished last run (ID:zt6vny14). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_125142-snqu74s0
Finishing last run (ID:snqu74s0) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▇▆▅▅▅▅▄▄▃▃▃▃▂▂▂▂▂▂▂▁▁▂▂▂▂▁▁▁▂▂▂▁▁▁▁▂▂▂▁
valid loss ▃▂▁▁▁▂▂▂▃▄▄▄▄▅▆▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇█▇██▇██

Run summary:


train loss 1.20064
valid loss 3.94179

View run run_lr_0.0009 at: https://wandb.ai/nvtai0452/llama2/runs/snqu74s0
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_125142-snqu74s0/logs
Successfully finished last run (ID:snqu74s0). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_132938-rvx2dzd5
Finishing last run (ID:rvx2dzd5) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▇▆▆▅▅▅▄▃▃▃▃▃▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▁▂▂▂▁▁▁▂▁▂▂▁
valid loss ▃▂▁▁▁▁▂▂▃▄▄▅▅▅▆▆▆▆▆▇▆▇▇▇▇▇▇▇▇▇█▇██▇█████

Run summary:


train loss 1.32851
valid loss 3.9453

View run run_lr_0.0008 at: https://wandb.ai/nvtai0452/llama2/runs/rvx2dzd5
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_132938-rvx2dzd5/logs
Successfully finished last run (ID:rvx2dzd5). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_140745-21qmtq46
Finishing last run (ID:21qmtq46) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▇▆▅▆▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▂▂▂▂▁▁▁▂▂▂▁▁▁▁▂▁▂▁
valid loss ▃▂▁▁▁▁▂▂▃▄▄▄▄▅▅▅▆▆▆▆▆▆▇▇▇▇▆▇▇▇▇▇▇█▇██▇██

Run summary:


train loss 1.20635
valid loss 3.98119

View run run_lr_0.0007000000000000001 at: https://wandb.ai/nvtai0452/llama2/runs/21qmtq46
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_140745-21qmtq46/logs
Successfully finished last run (ID:21qmtq46). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_144551-mate1gtw
Finishing last run (ID:mate1gtw) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▆▅▅▅▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▁▁▂▂▁▁▁▁▂▂▂▁
valid loss ▄▂▁▁▁▁▂▂▃▄▄▄▅▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇█▇███████▇▇█

Run summary:


train loss 1.29609
valid loss 3.83442

View run run_lr_0.0006000000000000001 at: https://wandb.ai/nvtai0452/llama2/runs/mate1gtw
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_144551-mate1gtw/logs
Successfully finished last run (ID:mate1gtw). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_152401-ace73pkp
Finishing last run (ID:ace73pkp) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▆▅▅▅▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▂▂▂▂▁▂▂▂▂▂▁▁▂▁▁▂▂▁
valid loss ▄▂▁▁▁▁▂▂▃▃▄▄▄▅▅▅▆▆▆▆▆▆▇▆▇▇▆▇▇▇▇▇▇█▇██▇██

Run summary:


train loss 1.31785
valid loss 3.92018

View run run_lr_0.0005 at: https://wandb.ai/nvtai0452/llama2/runs/ace73pkp
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_152401-ace73pkp/logs
Successfully finished last run (ID:ace73pkp). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_160227-lbe3t0fz
Finishing last run (ID:lbe3t0fz) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▆▅▅▅▅▅▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁
valid loss ▅▂▁▁▁▁▂▂▃▃▃▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇█▇█▇█████▇▇█

Run summary:


train loss 1.31858
valid loss 3.84997

View run run_lr_0.0004000000000000001 at: https://wandb.ai/nvtai0452/llama2/runs/lbe3t0fz
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_160227-lbe3t0fz/logs
Successfully finished last run (ID:lbe3t0fz). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_164059-kd22o4kb
Finishing last run (ID:kd22o4kb) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


train loss █▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
valid loss ▆▃▁▁▁▁▂▂▂▃▃▃▃▄▅▅▅▅▅▆▆▆▆▆▆▇▆▇▇▇▇▇▇▇▇▇█▇██

Run summary:


train loss 1.25272
valid loss 3.97291

View run run_lr_0.00030000000000000014 at: https://wandb.ai/nvtai0452/llama2/runs/kd22o4kb
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20231017_164059-kd22o4kb/logs
Successfully finished last run (ID:kd22o4kb). Initializing new run:
Tracking run with wandb version 0.15.12
Run data is saved locally in /content/wandb/run-20231017_171928-tpz23mt0